Feb 11, 2022

Objectives

  • Load microarray dataset into R
  • Explore the dataset with basic visualizations
  • Identify differentially expressed genes (DEGs)
  • Generate annotation of the DEGs (Tentative)

The Central Dogma of Biology

DNA makes RNA and RNA makes protein

Cleft Lip and Palate 1/3

Cleft lip and cleft palate (CLP) are splits in the upper lip, the roof of the mouth (palate) or both. They result when facial structures that are developing in an unborn baby do not close completely. CLP is one of the most common birth defects with a frequency of 1/700 live births.

Cleft lip and palate

Cleft Lip and Palate 2/3

Children with cleft lip with or without cleft palate face a variety of challenges, depending on the type and severity of the cleft.

  • Difficulty feeding. One of the most immediate concerns after birth is feeding.

  • Ear infections and hearing loss. Babies with cleft palate are especially at risk of developing middle ear fluid and hearing loss.

  • Dental problems. If the cleft extends through the upper gum, tooth development may be affected.

  • Speech difficulties. Because the palate is used in forming sounds, the development of normal speech can be affected by a cleft palate. Speech may sound too nasal.

Reference: Mayo Foundation for Medical Education and Research

Cleft Lip and Palate 3/3

  • DNA variation in Interferon Regulatory Factor 6 (IRF6) causes Van der Woude syndrome (VWS)

  • VWS is the most common syndromic form of cleft lip and palate.

  • However, the causing variant in IRF6 has been found in only 70% of VWS families!

  • IRF6 is a transcription factor with a conserved helix-loop-helix DNA binding domain and a less well-conserved protein binding domain.

Reference: Hum Mol Genet. 2014 May 15; 23(10): 2711–2720

Question

Given:

  1. The pathogenic variant in IRF6 exists in only 70% of the VWS families

  2. IRF6 is a transcription factor

How can we identify other genes that might be involved in the remaining 30% of the VWS families?

Hint

  • Usually, genes that are regulated by a transcription factor belong to the same biological process or pathway.

  • Therefore, by comparing the gene expression patterns between wild-type (functional) Irf6 and knockout (non-functional) Irf6, it could be possible to identify genes that are regulated (targeted) by Irf6.

Hypothesis

  • \(H_O : \mu_{WT} = \mu_{KO}\)

  • \(H_A : \mu_{WT} \ne \mu_{KO}\)

  • Where \(\mu\) is the mean of the gene expression values of a gene.

  • One-sided or Two-sided testing?

Why Microarray?

Why Microarray?

  • No need for candidate genes (or genes of interest)

  • One experiment assesses the entire transcriptome

  • One experiment generates many hypotheses

  • Only small amount of RNA is required (~15–200 ng)

Original Paper

PMID: 17041601

Experimental Design

  • 3 IRF6 wild-type (+/+) and 3 knockout (-/-) mouse embryos.
  • E17.5 embryos were removed from euthanized mothers.
  • Skin was removed from embryos.
  • Total RNA was isolated from the skin.
  • Resultant RNA was hybridized to Affymetrix GeneChip Mouse Genome 430 2.0 arrays.

Dataset

  • The original dataset can be obtained from NCBI GEO with accession GSE5800
ID KO1 KO2 KO3 WT1 WT2 WT3
1415670_at 6531.0 5562.8 6822.4 7732.1 7191.2 7551.9
1415671_at 11486.3 10542.7 10641.4 10408.2 9484.5 7650.2
1415672_at 14339.2 13526.1 14444.7 12936.6 13841.7 13285.7
1415673_at 3156.8 2219.5 3264.4 2374.2 2201.8 2525.3

Loading

First, we are going to load the dataset from the .tsv file into R as a variable called data using the read.table function.

Note: .tsv stands for tab-separated values, which is simply a plain text file. The file itself can downloaded from here.

In the following code, data is just an arbitrary variable name to hold the result of read.table. See here for valid variable names in R. Also it is recommended to see here for generally Good Variable Names.

Loading

# Load the data from a file into a variable
data = read.table("https://media.githubusercontent.com/media/ahmedmoustafa/gene-expression-datasets/main/datasets/irf6/irf6.tsv", header = TRUE, row.names = 1)

# Convert the data.frame (table) in a matrix (numeric)
data = as.matrix(data)

Note: the hash sign (#) indicates that what comes after is a comment. Comments are for documentation and readability of the code and they are not evaluated (or executed).

Checking

dim(data) # Dimension of the dataset
## [1] 45101     6
head(data) # First few rows
KO1 KO2 KO3 WT1 WT2 WT3
1415670_at 6531.0 5562.8 6822.4 7732.1 7191.2 7551.9
1415671_at 11486.3 10542.7 10641.4 10408.2 9484.5 7650.2
1415672_at 14339.2 13526.1 14444.7 12936.6 13841.7 13285.7
1415673_at 3156.8 2219.5 3264.4 2374.2 2201.8 2525.3
1415674_a_at 4002.0 3306.9 3777.0 3760.6 3137.0 2911.5
1415675_at 3468.4 3347.4 3332.9 3073.5 3046.0 2914.4

Number of Genes and IDs

number_of_genes = nrow(data) # number of genes = number of rows
number_of_genes
## [1] 45101
ids = row.names(data) # The ids of the genes are the names of the rows
head(ids)
## [1] "1415670_at"   "1415671_at"   "1415672_at"   "1415673_at"   "1415674_a_at"
## [6] "1415675_at"

Exploring

Check the behavior of the data (e.g., normal?, skewed?)

hist(data, col = "gray", main="Histogram")

Transforming

\(log_2\) transformation (why?)

data2 = log2(data)
hist(data2, col = "gray")

Boxplot

colors = c(rep("navy", 3), rep("orange", 3))
boxplot(data2, col = colors)

Clustering

Hierarchical clustering of the samples (i.e., columns) based on the correlation coefficients of the expression values

hc = hclust(as.dist(1 - cor(data2)))
plot(hc)

To learn more about a function (e.g., hclust), you may type ?function (e.g., ?hclust) in the console to launch R documentation on that function.

Splitting Data Matrix into two tables

KO

ko = data2[, 1:3] # KO matrix
head(ko)
KO1 KO2 KO3
1415670_at 12.67309 12.44160 12.73606
1415671_at 13.48763 13.36396 13.37740
1415672_at 13.80768 13.72346 13.81825
1415673_at 11.62425 11.11602 11.67260
1415674_a_at 11.96651 11.69126 11.88303
1415675_at 11.76005 11.70883 11.70256

Splitting Data Matrix into two tables

WT

wt = data2[, 4:6] # WT matrix
head(wt)
WT1 WT2 WT3
1415670_at 12.91664 12.81202 12.88262
1415671_at 13.34543 13.21136 12.90128
1415672_at 13.65917 13.75673 13.69759
1415673_at 11.21323 11.10447 11.30224
1415674_a_at 11.87675 11.61517 11.50755
1415675_at 11.58567 11.57270 11.50898

Gene (Row) Mean Expression

# Compute the means of the KO samples
ko.means = rowMeans(ko)
head(ko.means)
##   1415670_at   1415671_at   1415672_at   1415673_at 1415674_a_at   1415675_at 
##     12.61692     13.40966     13.78313     11.47096     11.84693     11.72381
# Compute the means of the WT samples
wt.means = rowMeans(wt)
head(wt.means)
##   1415670_at   1415671_at   1415672_at   1415673_at 1415674_a_at   1415675_at 
##     12.87043     13.15269     13.70450     11.20664     11.66649     11.55578

Scatter between the means

plot(ko.means ~ wt.means) # The actual scatter plot
abline(0, 1, col = "red") # Only a diagonal line

Differential Gene Expression Analysis

To identify Differentially Expressed Genes (DEGs), we will identify:

  • Biologically significantly differentially expressed
  • Statistically significantly differentially expressed

Then, we will take the overlap (intersection) of the two sets

Biological Significance (fold-change)

fold = ko.means - wt.means # Difference between means
head(fold)
##   1415670_at   1415671_at   1415672_at   1415673_at 1415674_a_at   1415675_at 
##  -0.25351267   0.25697097   0.07863227   0.26431191   0.18044345   0.16803065
  • What do the positive and negative values of the fold-change indicate? Considering the WT condition is the reference (or control)

  • +ve fold-change \(\rightarrow\) Up-regulation \(\uparrow\)

  • -ve fold-change \(\rightarrow\) Down-regulation \(\downarrow\)

Biological Significance (fold-change)

hist(fold, col = "gray") # Histogram of the fold

Statistical Significance (p-value)

  • To assess the statistical significance of the difference in the expression values for each gene between the two conditions (e.g., WT and KO), we are going to use t-test.

t-test

Let’s say there are two samples x and y from the two populations, X and Y, respectively, to determine whether the means of two populations are significantly different, we can use t.test.

t-test : Example 1

x = c(4, 3, 10, 7, 9) ; y = c(7, 4, 3, 8, 10)
t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0.1066, df = 7.9743, p-value = 0.9177
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.12888  4.52888
## sample estimates:
## mean of x mean of y 
##       6.6       6.4
t.test(x, y)$p.value
## [1] 0.917739

t-test : Example 2

x = c(6, 8, 10, 7, 9) ; y = c(3, 2, 1, 4, 5)
t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 5, df = 8, p-value = 0.001053
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.693996 7.306004
## sample estimates:
## mean of x mean of y 
##         8         3
t.test(x, y)$p.value
## [1] 0.001052826

Let’s compute the p-value for all genes using a for-loop of t.test, one gene at a time:

pvalue = NULL # Empty list for the p-values

for(i in 1 : number_of_genes) { # for each gene from to the number of genes
  x = wt[i, ] # wt values of gene number i
  y = ko[i, ] # ko values of gene number i
  t = t.test(x, y) # t-test between the two conditions
  pvalue[i] = t$p.value # Store p-value number i into the list of p-values
}
head(pvalue)
## [1] 0.092706280 0.182663337 0.129779075 0.272899180 0.262377176 0.005947807
hist(-log10(pvalue), col = "gray") # Histogram of p-values (-log10)

Volcano

plot(-log10(pvalue) ~ fold)

fold_cutoff = 2
pvalue_cutoff = 0.01

plot(-log10(pvalue) ~ fold)

abline(v = fold_cutoff, col = "blue", lwd = 3)
abline(v = -fold_cutoff, col = "red", lwd = 3)
abline(h = -log10(pvalue_cutoff), col = "green", lwd = 3)

Filtering for DEGs

filter_by_fold = abs(fold) >= fold_cutoff # Biological
sum(filter_by_fold) # Number of genes staisfy the condition
## [1] 1051
filter_by_pvalue = pvalue <= pvalue_cutoff # Statistical
sum(filter_by_pvalue)
## [1] 1564
filter_combined = filter_by_fold & filter_by_pvalue # Combined
sum(filter_combined)
## [1] 276
filtered = data2[filter_combined, ]
dim(filtered)
## [1] 276   6
head(filtered)
KO1 KO2 KO3 WT1 WT2 WT3
1416200_at 13.312004 12.973357 12.868456 7.40429 8.558803 8.683696
1416236_a_at 14.148397 14.039236 14.130007 12.23604 12.022403 11.495055
1417808_at 5.321928 5.442944 4.053111 15.16978 15.070087 14.753274
1417932_at 10.602884 10.257152 10.496055 13.98445 14.203295 13.720960
1418050_at 10.622052 10.975490 10.795066 12.86513 13.012048 12.658122
1418100_at 9.117903 8.634811 9.057721 12.90358 12.842449 12.233769
plot(-log10(pvalue) ~ fold)
points(-log10(pvalue[filter_combined]) ~ fold[filter_combined],
       col = "green")

Exercise

On the volcano plot, highlight the up-regulated genes in red and the download-regulated genes in blue

Click here for the solution
  • Up-regulated genes
# Screen for the up-regulated genes (+ve fold)
filter_up = filter_combined & fold > 0

head(filter_up)
##   1415670_at   1415671_at   1415672_at   1415673_at 1415674_a_at   1415675_at 
##        FALSE        FALSE        FALSE        FALSE        FALSE        FALSE
# Number of filtered genes
sum(filter_up)
## [1] 95
  • Down-regulated genes
# Screen for the down-regulated genes (-ve fold)
filter_down = filter_combined & fold < 0

head(filter_down)
##   1415670_at   1415671_at   1415672_at   1415673_at 1415674_a_at   1415675_at 
##        FALSE        FALSE        FALSE        FALSE        FALSE        FALSE
# Number of filtered genes
sum(filter_down)
## [1] 181
plot(-log10(pvalue) ~ fold)
points(-log10(pvalue[filter_up]) ~ fold[filter_up], col = "red")
points(-log10(pvalue[filter_down]) ~ fold[filter_down], col = "blue")

Heatmap

heatmap(filtered)

Heatmap

  • By default, heatmap clusters genes (rows) and samples (columns) based on the Euclidean distance.

  • In the context of gene expression, we need to cluster genes and samples based on the correlation to explore patterns of co-regulation (co-expression) - Guilt by Association.

  • To let heatmap cluster the genes and/or samples, the genes and samples will be clustered (grouped) by correlation coefficients (using cor) among the genes and samples.

Heatmap

# Clustering of the columns (samples)
col_dendrogram = as.dendrogram(hclust(as.dist(1-cor(filtered))))

# Clustering of the rows (genes)
row_dendrogram = as.dendrogram(hclust(as.dist(1-cor(t(filtered)))))

Heatmap

# Heatmap with the rows and columns clustered by correlation coefficients
heatmap(filtered, Rowv=row_dendrogram, Colv=col_dendrogram)

Heatmap

library(gplots) # Load the gplots library
heatmap(filtered, Rowv=row_dendrogram, Colv=col_dendrogram, col = rev(redgreen(1024)))

Annnotation

To obtain the functional annotation of the differentially expressed genes, we are going first to extract their probe ids:

filtered_ids = row.names(filtered) # ids of the filtered DE genes
length(filtered_ids)
## [1] 276
head(filtered_ids)
## [1] "1416200_at"   "1416236_a_at" "1417808_at"   "1417932_at"   "1418050_at"  
## [6] "1418100_at"
write.table (filtered_ids, "results/filtered_ids.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

Using DAVID annotation database

Using BioConductor annotation packages

Alternatively, we can generate a comprehensive functional annotation via BioConductor packages annaffy and mouse4302.db.

If the packages are not already installed, they can be installed using the following code:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("annaffy", "mouse4302.db"))

Annnotation

Next the annotation can be obtained by loading the library:

library(annaffy)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: BiocManager
## Loading required package: GO.db
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:gplots':
## 
##     space
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:tidyr':
## 
##     expand
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
## 
##     select
## 

Annnotation

Then generating and exporting the annotation into HTML format:

annotation_table = aafTableAnn(filtered_ids, "mouse4302.db")
## Loading required package: mouse4302.db
## Loading required package: org.Mm.eg.db
## 
## 
saveHTML(annotation_table, file="results/filtered_annotation.html")
browseURL("results/filtered_annotation.html")

Annnotation

As an HTML file, the annotation table can now be opened in any standard browser.

For a further focused functional analysis, let’s split the DEGs according to the regulation pattern (up-regulated vs. down-regulated) to determine which functions/pathways/processes turned on or off from WT to KO:

Up-regulated Genes

filtered_up_ids = ids[filter_up]
length (filtered_up_ids)
## [1] 95
write.table (filtered_up_ids, "results/filtered_up_ids.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
annotation_table = aafTableAnn(filtered_up_ids, "mouse4302.db")
saveHTML(annotation_table, file="results/filtered_up_annotation.html")
browseURL("results/filtered_up_annotation.html")

Here is the functional annotation of the up-regulated genes

Down-regulated Genes

filtered_down_ids = ids[filter_down]
length (filtered_down_ids)
## [1] 181
write.table (filtered_down_ids, "results/filtered_down_ids.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
annotation_table = aafTableAnn(filtered_down_ids, "mouse4302.db")
saveHTML(annotation_table, file="results/filtered_down_annotation.html")
browseURL("results/filtered_down_annotation.html")

Here is the functional annotation of the down-regulated genes

Sanity Check (Irf6)

Down Regulation of Irf6

Sanity Check (Irf6)

Multiple Testing Correction

We conducted 45,101 statistical tests. The computed p-values should be corrected for multiple testing. The correction can be performed using p.adjust, which simply takes the original p-values as a vector and returns the adjusted (corrected) p-values using the False Discovery Rate adjustment method:

adjusted.pvalues = p.adjust(pvalue, method = "fdr")

The number of the original p-values \(\leq\) 0.05 is 5,099 while the number of adjusted p-values \(\leq\) 0.05 is 9.

Multiple Testing Correction

Here is a sample of the original p-values and their corresponding adjusted p-values:

pvalue adjusted.pvalue
0.0927063 0.5278755
0.1826633 0.6346918
0.1297791 0.5805456
0.2728992 0.7025472
0.2623772 0.6967834
0.0059478 0.2518079

Multiple Testing Correction

And here is the overall relationship between the original and the adjusted p-values:

Note: Formally, we should have performed all p-value-based analyses (filtering and downstream steps) based on the adjusted p-values, but for the sake of simplicity, we proceeded with the original p-values instead of the adjusted p-values.

Thank you!

Funny Genes